Skip to content

feat: specify retry configuration overrides per subgraph#2172

Open
SkArchon wants to merge 32 commits intomainfrom
milinda/retry-per-subgraphs
Open

feat: specify retry configuration overrides per subgraph#2172
SkArchon wants to merge 32 commits intomainfrom
milinda/retry-per-subgraphs

Conversation

@SkArchon
Copy link
Copy Markdown
Contributor

@SkArchon SkArchon commented Aug 27, 2025

Fixes ENG-7966

This PR adds the option for users to set retry settings per subgraph. If a subgraph retry is specified for all, it will be used by default, with any subgraph overrides either disabling or changing the settings per subgraph. Example configuration

traffic_shaping:
  all:
    retry:
      max_attempts: 5
  subgraphs:
    employee:
      retry:
        interval: 5s
        max_attempts: 7

This PR also adds a fix for upgrade requests for circuit breakers, as the engine hooks context key with the subgraph name is not set on upgrade requests.

Summary by CodeRabbit

  • New Features

    • Per-subgraph retry controls with global defaults, expression-driven policies, Retry-After handling, and retry callbacks; WebSocket/upgrade paths now exercise retry behavior.
  • Refactor

    • Centralized retry manager and expression manager; retry logic unified across transport, executor, circuit breaker and tracing with a builder-style configuration API and per-subgraph awareness.
  • Tests

    • Expanded coverage for per-subgraph retries, expression evaluation, Retry-After variants, retry counting, panic/error propagation, and circuit-breaker upgrade flows (one test duplicated).

Checklist

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Aug 27, 2025

Note

Reviews paused

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Walkthrough

Replace primitive multi-argument subgraph-retry configuration with a structured, per-subgraph, expression-driven system: add SubgraphRetryOptions/NewSubgraphRetryOptions, a RetryExpressionManager, and an internal retry Manager; wire them into transports, graph server, router config, circuit breaker, traceclient, and tests.

Changes

Cohort / File(s) Summary of changes
Tests: migrate to typed retry options
router-tests/error_handling_test.go, router-tests/panic_test.go, router-tests/structured_logging_test.go, router-tests/telemetry/telemetry_test.go
Replace multi-arg WithSubgraphRetryOptions(...) calls with core.NewSubgraphRetryOptions(config.TrafficShapingRules{...}) and WithSubgraphRetryOptions(opts); use structured TrafficShapingRules (e.g., BackoffJitterRetry.Enabled=false).
Retry tests: builder usage & per-subgraph coverage
router-tests/retry_test.go
Migrate tests to construct SubgraphRetryOptions via NewSubgraphRetryOptions(...), set OnRetryFunc, pass to WithSubgraphRetryOptions(opts); add TestRetryPerSubgraph exercising global vs per-subgraph rules, invalid-algorithm handling, per-subgraph counters, and websocket subscription cases.
WebSocket & circuit-breaker tests
router-tests/websocket_test.go, router-tests/circuit_breaker_test.go
Inject typed SubgraphRetryOptions into websocket tests; add circuit-breaker WebSocket upgrade-path test (uses github.com/gorilla/websocket) — duplicated test copies present in diff.
Router public API & config
router/core/router.go, router/core/router_config.go, router/core/supervisor_instance.go
Add exported SubgraphRetryOptions (All, SubgraphMap, OnRetryFunc) and IsEnabled(); replace old multi-arg WithSubgraphRetryOptions with WithSubgraphRetryOptions(*SubgraphRetryOptions); add NewSubgraphRetryOptions(cfg config.TrafficShapingRules); Config now stores *SubgraphRetryOptions.
Graph server: manager initialization & injection
router/core/graph_server.go
Create RetryExpressionManager, build retry function with BuildRetryFunction, instantiate and initialize retrytransport.Manager, store on server and pass RetryManager into ExecutorConfigurationBuilder.
Retry builder & tests: expression-manager-driven
router/core/retry_builder.go, router/core/retry_builder_test.go
Add BuildRetryFunction(manager *expr.RetryExpressionManager) returning a retry closure that accepts an expression string; add mutation guard and EOF default-retry logic; update tests to use expression manager and explicit expression parameter.
Transport core: API and wiring changes
router/core/transport.go
Replace retry-options parameters with retryManager *retrytransport.Manager; update NewCustomTransport, TransportFactory, and TransportOptions to accept/store a manager and pass it to NewRetryHTTPTransport.
Expression engine: multi-expression manager
router/internal/expr/retry_expression.go, router/internal/expr/retry_expression_test.go
Replace single compiled program with an expression map; add NewRetryExpressionManager() and AddExpression(exprString) error; change ShouldRetry(ctx, expressionString) signature and update tests to add/evaluate named expressions.
Retry manager: per-subgraph manager implementation
router/internal/retrytransport/manager.go
Add Manager type with per-subgraph RetryOptions, exprManager, ShouldRetry delegate and OnRetry hook; provide NewManager, Initialize, GetSubgraphOptions, IsEnabled, and Retry methods; validate algorithms and register expressions.
Retry HTTP transport & tests: manager-driven runtime logic
router/internal/retrytransport/retry_transport.go, router/internal/retrytransport/retry_transport_test.go
Switch transport to use *Manager for per-subgraph options; add getActiveSubgraph resolution; implement per-subgraph retry loops bounded by MaxRetryCount; add Retry-After parsing/handling and OnRetry hook integration; update constructor to NewRetryHTTPTransport(roundTripper, getRequestLogger, retryManager, getActiveSubgraph) and adapt tests.
Circuit, traceclient, trace-injection callbacks
router/internal/circuit/breaker.go, router/internal/traceclient/traceclient.go
Add getActiveSubgraphName and separate getExprContext callbacks; update constructors and RoundTrip/metrics to use callbacks instead of reading subgraph from request context.
Engine loader & context keys cleanup
router/core/engine_loader_hooks.go, router/internal/context/keys.go
Remove injecting current subgraph into request context in OnLoad; delete CurrentSubgraphContextKey type and remove context-key-based subgraph propagation.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title "feat: specify retry configuration overrides per subgraph" is concise and accurately summarizes the PR’s primary change—adding per-subgraph retry override support—using a conventional "feat:" prefix and clear wording that a teammate scanning history will understand.

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Aug 27, 2025

Router-nonroot image scan passed

✅ No security vulnerabilities found in image:

ghcr.io/wundergraph/cosmo/router:sha-22bdbb2c7405910b5fe70a3516beefeca38538d5-nonroot

@SkArchon SkArchon changed the base branch from main to dustin/eng-7974-allow-to-configure-retry-condition August 27, 2025 15:14
Base automatically changed from dustin/eng-7974-allow-to-configure-retry-condition to main August 28, 2025 11:57
@SkArchon SkArchon force-pushed the milinda/retry-per-subgraphs branch from f6f2adc to ef2a9a6 Compare September 1, 2025 11:22
@SkArchon SkArchon force-pushed the milinda/retry-per-subgraphs branch from 8a7da2f to d86e5cc Compare September 1, 2025 12:53
@SkArchon SkArchon marked this pull request as ready for review September 1, 2025 13:01
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
router/internal/retrytransport/retry_transport.go (1)

142-144: Honor context cancellation during backoff sleep.

Abort retries promptly if the request context is canceled.

-		// Wait for the specified duration
-		time.Sleep(sleepDuration)
+		// Wait for the specified duration or abort if canceled
+		select {
+		case <-req.Context().Done():
+			return resp, req.Context().Err()
+		case <-time.After(sleepDuration):
+		}
🧹 Nitpick comments (25)
router-tests/structured_logging_test.go (1)

712-712: Consistent migration to object-based retry options in tests

All updated WithSubgraphRetryOptions(core.NewSubgraphRetryOptions(... Enabled: false ...)) calls look correct and keep logs deterministic by avoiding subgraph retries.

If you find yourself disabling retries across many tests, consider a small helper in testenv (e.g., testenv.DisableSubgraphRetriesOption()) to reduce repetition and ease future changes.

Also applies to: 832-832, 964-964, 1100-1100, 2214-2214

router-tests/telemetry/telemetry_test.go (2)

8695-8700: Per-subgraph retry builder usage looks correct; optionally assert “no retries” explicitly

Switching to core.WithSubgraphRetryOptions(core.NewSubgraphRetryOptions(...)) with Enabled: false matches the new API and aligns with our prior learning that algorithms shouldn’t be validated when retries are disabled. If you want a stronger assertion, consider wiring an OnRetryFunc (if exposed on SubgraphRetryOptions) to increment a counter and assert it remains zero during this test.


9639-9642: Check all datapoints, not just the first, for absence of the subgraph attribute

Guard against future exporter changes that may add more datapoints by asserting across the full set.

-					data2 := httpRequestsMetric.Data.(metricdata.Sum[int64])
-					atts := data2.DataPoints[0].Attributes
-					_, ok := atts.Value(attribute.Key(key))
-					require.False(t, ok)
+					data2 := httpRequestsMetric.Data.(metricdata.Sum[int64])
+					for _, dp := range data2.DataPoints {
+						_, ok := dp.Attributes.Value(attribute.Key(key))
+						require.False(t, ok)
+					}
router-tests/error_handling_test.go (1)

1383-1387: Good migration to manager-based retry config; consider a helper to DRY tests

The switch to core.WithSubgraphRetryOptions(core.NewSubgraphRetryOptions(...)) with Enabled:false is correct and matches our prior learning that disabled retries shouldn’t validate algorithm/fields. You repeat the same disabled config 4x; a tiny helper keeps tests concise.

Apply this to replace the repeated blocks:

-                core.WithSubgraphRetryOptions(core.NewSubgraphRetryOptions(config.TrafficShapingRules{
-                    All: config.GlobalSubgraphRequestRule{
-                        BackoffJitterRetry: config.BackoffJitterRetry{Enabled: false},
-                    },
-                })),
+                disabledSubgraphRetries(),

Add this helper near the top of the file (outside the selected ranges):

// test helper
func disabledSubgraphRetries() core.Option {
	return core.WithSubgraphRetryOptions(core.NewSubgraphRetryOptions(config.TrafficShapingRules{
		All: config.GlobalSubgraphRequestRule{
			BackoffJitterRetry: config.BackoffJitterRetry{Enabled: false},
		},
	}))
}

Also applies to: 1419-1423, 1455-1459, 1491-1495

router/core/graph_server.go (1)

266-287: router/core/graph_server.go: guard nil retryOptions and wrap Initialize error

-    if s.retryOptions.IsEnabled() {
+    if s.retryOptions != nil && s.retryOptions.IsEnabled() {
         retryExprManager := expr.NewRetryExpressionManager()
         // …
         err = retryManager.Initialize(
             s.retryOptions.All,
             s.retryOptions.SubgraphMap,
             routerConfig.GetSubgraphs(),
         )
-        if err != nil {
-            return nil, err
-        }
+        if err != nil {
+            return nil, fmt.Errorf("failed to initialize retry manager: %w", err)
+        }
         s.retryManager = retryManager
     }
router/internal/expr/retry_expression.go (3)

16-23: Constructor is fine; optional: precompile default expression

As a small UX improvement, consider precompiling the default expression during construction so first evaluations don’t depend on prior AddExpression calls.

Example (if you decide to change signature elsewhere to allow error return):

m := &RetryExpressionManager{expressionMap: make(map[string]*vm.Program)}
_ = m.AddExpression("") // ignore error; Initialize can still validate real strings
return m

25-50: Normalize expression key or fix the comment

Comment says “normalized” but no normalization occurs. Either trim/normalize or update the comment.

 import (
     "fmt"
     "reflect"
+    "strings"

     "github.com/expr-lang/expr"
     "github.com/expr-lang/expr/vm"
 )
@@
-func (c *RetryExpressionManager) AddExpression(exprString string) error {
-    expression := exprString
+func (c *RetryExpressionManager) AddExpression(exprString string) error {
+    expression := strings.TrimSpace(exprString)
@@
-    // Use the normalized expression string as the key for deduplication
+    // Use the trimmed expression string as the key for deduplication
     c.expressionMap[expression] = program
     return nil
 }

53-80: Consider lazy compilation on cache miss instead of silently returning false

Fail-closed is safe, but lazily compiling unknown expressions can reduce surprises if Initialize missed one.

 func (m *RetryExpressionManager) ShouldRetry(ctx RetryContext, expressionString string) (bool, error) {
     if m == nil {
         return false, nil
     }

-    expression := expressionString
+    expression := expressionString
     if expression == "" {
         expression = defaultRetryExpression
     }

     program, ok := m.expressionMap[expression]
     if !ok {
-        // If the expression wasn't pre-compiled, do not retry by default
-        return false, nil
+        // Lazily compile the expression once
+        if err := m.AddExpression(expression); err != nil {
+            return false, err
+        }
+        program = m.expressionMap[expression]
     }

If AddExpression can be invoked concurrently in your environment, add an RWMutex around expressionMap accesses.

Would you like me to add a minimal, lock-guarded version?

router/internal/expr/retry_expression_test.go (2)

141-146: Strengthen the empty-expression test

Consider asserting runtime behavior as well (e.g., ShouldRetry returns false) to guard against regressions, not just that AddExpression("") succeeds and the manager is non-nil.

 err := manager.AddExpression("")
 assert.NoError(t, err)
 assert.NotNil(t, manager)
+res, evalErr := manager.ShouldRetry(RetryContext{}, "")
+assert.NoError(t, evalErr)
+assert.False(t, res)

528-551: Parallelize subtests to speed up and isolate failures (optional)

You can mark the top-level test and subtests with t.Parallel() for faster runs; table-driven subtests here are good candidates.

 func TestRetryExpressionManager_WithSyscallErrors(t *testing.T) {
+  t.Parallel()
   t.Run("expression combining status and syscall errors", func(t *testing.T) {
+    t.Parallel()
router/core/router.go (3)

187-206: SubgraphRetryOptions: shape and IsEnabled() look good

Structure and enablement logic are sound; nil map iteration is safe. Minor: consider doc comments on fields for clarity (global “All” vs per-subgraph overrides).


1908-1923: Comment mismatch: “circuit breakers” → “retry options”

The comment under NewSubgraphRetryOptions mentions circuit breakers; it should refer to retry options.

- // Subgraph specific circuit breakers
+ // Subgraph-specific retry options

1924-1937: Default algorithm only when retries are enabled (aligns with team preference)

Per team learning, when retries are disabled we shouldn’t “enforce” algorithm defaults. Guard the defaulting by Enabled to better reflect that stance.

-func newRetryConfig(config config.BackoffJitterRetry) retrytransport.RetryOptions {
-    algorithm := config.Algorithm
-    if algorithm == "" {
-        algorithm = retrytransport.BackoffJitter
-    }
+func newRetryConfig(config config.BackoffJitterRetry) retrytransport.RetryOptions {
+    algorithm := config.Algorithm
+    if config.Enabled && algorithm == "" {
+        algorithm = retrytransport.BackoffJitter
+    }
     return retrytransport.RetryOptions{
         Enabled:       config.Enabled,
         Algorithm:     algorithm,
         MaxRetryCount: config.MaxAttempts,
         MaxDuration:   config.MaxDuration,
         Interval:      config.Interval,
         Expression:    config.Expression,
     }
 }
router/core/retry_builder.go (1)

15-23: Defensive nil check for req (optional)

If ShouldRetry ever gets a nil req from the transport layer, req.Context() will panic. Cheap guard:

-reqContext := getRequestContext(req.Context())
+if req == nil {
+  return false
+}
+reqContext := getRequestContext(req.Context())
router-tests/retry_test.go (1)

76-78: Avoid brittle full-JSON equality in assertions.

These assertions will break on harmless message changes. Prefer checking essential fields (e.g., statusCode) or comparing decoded JSON structures.
Example:

- require.Equal(t, `{"errors":[{"message":"Failed to fetch from Subgraph 'employees', Reason: empty response.","extensions":{"statusCode":502}}],"data":{"employees":null}}`, res.Body)
+ require.JSONEq(t, `{"data":{"employees":null}}`, res.Body)
+ require.Contains(t, res.Body, `"statusCode":502`)

Also applies to: 127-129, 180-182, 296-298, 360-372

router/internal/retrytransport/retry_transport_test.go (2)

268-298: Add a zero-retries guard test.

Add a case with MaxRetryCount=0 to assert no retries occur and OnRetry isn’t called.

Sketch:

t.Run("no retries when MaxRetryCount=0", func(t *testing.T) {
  retries, attempts := 0, 0
  mgr := newTestManager(simpleShouldRetry, func(int, *http.Request, *http.Response, time.Duration, error) { retries++ },
    RetryOptions{Enabled: true, MaxRetryCount: 0, Interval: time.Millisecond, MaxDuration: time.Second}, "sg")
  tr := NewRetryHTTPTransport(&MockTransport{handler: func(*http.Request) (*http.Response, error) {
    attempts++; return nil, errors.New("fail")
  }}, func(*http.Request) *zap.Logger { return zap.NewNop() }, mgr)
  _, err := tr.RoundTrip(withSubgraph(httptest.NewRequest("GET", "http://x", nil), "sg"))
  assert.Error(t, err)
  assert.Equal(t, 0, retries)
  assert.Equal(t, 1, attempts)
})

383-414: Log-string assertions are inherently brittle.

If wording changes, tests fail. Consider matching on a stable key/field or use zap observer to assert fields instead of formatted strings.

router/core/retry_builder_test.go (4)

52-57: Test name doesn’t match behavior.

No “disabled” path is exercised here; the test only builds the function. Rename to “build function returns non-nil with empty manager” or add a real disabled-case.

-t.Run("build function when retry is disabled", func(t *testing.T) {
+t.Run("build function returns non-nil with empty manager", func(t *testing.T) {

176-189: Clarify EOF comment.

Comment says “EOF is now handled at transport layer,” but this test expects true here via default retryable error handling. Tweak the comment to avoid confusion.

- // EOF is now handled at transport layer, not expression
+ // EOF can be retried via default transport-level detection; expression is false here

442-485: Remove stale, commented-out tests.

Dead code makes maintenance harder. Either port to the new manager-based API or delete.

-// TODO: Fix
-//func TestProcessRetryOptions(t *testing.T) {
-//  ... (entire commented block)
-//}
+// (removed legacy ProcessRetryOptions tests; superseded by expression-manager and manager-based flows)

51-57: Add a “no request context” guard test.

BuildRetryFunction returns false when request context is missing. Add a unit test to lock this in.

Sketch:

t.Run("returns false when request context missing", func(t *testing.T) {
  manager := expr.NewRetryExpressionManager()
  _ = manager.AddExpression("true")
  fn, _ := BuildRetryFunction(manager)
  req, _ := http.NewRequest("POST", "http://example.com/graphql", nil) // no ctx attached
  assert.False(t, fn(nil, req, &http.Response{StatusCode: 500}, "true"))
})

Want me to open a follow-up PR adding this?

Also applies to: 139-146

router/internal/retrytransport/manager.go (2)

88-91: Remove unreachable config-not-found branch.

customSgNames is only populated when a subgraph has an entry in subgraphRetryOptions; this branch cannot trigger.

-		entry, ok := subgraphRetryOptions[sgName]
-		if !ok {
-			joinErr = errors.Join(joinErr, errors.New("retry config not found for subgraph "+sgName))
-			continue
-		}
+		entry := subgraphRetryOptions[sgName]

118-121: Rename local variable for clarity.

Avoid “circuitBreaker” in retry code.

-	if circuitBreaker, ok := c.retries[name]; ok {
-		return circuitBreaker
+	if opts, ok := c.retries[name]; ok {
+		return opts
router/internal/retrytransport/retry_transport.go (2)

145-151: Option: drain before sleeping to maximize connection reuse.

Draining immediately frees the connection for reuse while we wait.

-		// drain the previous response before retrying
-		rt.drainBody(resp, requestLogger)
-
-		// Retry the request
-		resp, err = rt.roundTripper.RoundTrip(req)
+		// drain the previous response before waiting/retrying
+		rt.drainBody(resp, requestLogger)
+		// Retry the request
+		resp, err = rt.roundTripper.RoundTrip(req)

50-56: Log level nit: malformed Retry-After.

Consider Warn instead of Error to reduce noise for client-driven header issues.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 67c417f and d86e5cc.

📒 Files selected for processing (17)
  • router-tests/error_handling_test.go (4 hunks)
  • router-tests/panic_test.go (2 hunks)
  • router-tests/retry_test.go (10 hunks)
  • router-tests/structured_logging_test.go (5 hunks)
  • router-tests/telemetry/telemetry_test.go (2 hunks)
  • router/core/graph_server.go (4 hunks)
  • router/core/retry_builder.go (2 hunks)
  • router/core/retry_builder_test.go (15 hunks)
  • router/core/router.go (3 hunks)
  • router/core/router_config.go (2 hunks)
  • router/core/supervisor_instance.go (1 hunks)
  • router/core/transport.go (6 hunks)
  • router/internal/expr/retry_expression.go (2 hunks)
  • router/internal/expr/retry_expression_test.go (2 hunks)
  • router/internal/retrytransport/manager.go (1 hunks)
  • router/internal/retrytransport/retry_transport.go (4 hunks)
  • router/internal/retrytransport/retry_transport_test.go (33 hunks)
🧰 Additional context used
🧠 Learnings (8)
📓 Common learnings
Learnt from: SkArchon
PR: wundergraph/cosmo#2167
File: router-tests/structured_logging_test.go:712-713
Timestamp: 2025-08-28T10:07:27.637Z
Learning: In the router retry configuration, when retries are disabled (enabled=false in WithSubgraphRetryOptions), the algorithm parameter should not be validated since it won't be used. Empty algorithm strings are acceptable when retries are disabled.
📚 Learning: 2025-08-20T10:08:17.857Z
Learnt from: endigma
PR: wundergraph/cosmo#2155
File: router/core/router.go:1857-1866
Timestamp: 2025-08-20T10:08:17.857Z
Learning: router/pkg/config/config.schema.json forbids null values for traffic_shaping.subgraphs: additionalProperties references $defs.traffic_shaping_subgraph_request_rule with type "object". Therefore, in core.NewSubgraphTransportOptions, dereferencing each subgraph rule pointer is safe under schema-validated configs, and a nil-check is unnecessary.

Applied to files:

  • router/core/supervisor_instance.go
  • router/core/router.go
  • router-tests/telemetry/telemetry_test.go
  • router/core/router_config.go
  • router-tests/retry_test.go
  • router-tests/error_handling_test.go
  • router-tests/structured_logging_test.go
  • router-tests/panic_test.go
📚 Learning: 2025-08-28T10:07:27.637Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2167
File: router-tests/structured_logging_test.go:712-713
Timestamp: 2025-08-28T10:07:27.637Z
Learning: In the router retry configuration, when retries are disabled (enabled=false in WithSubgraphRetryOptions), the algorithm parameter should not be validated since it won't be used. Empty algorithm strings are acceptable when retries are disabled.

Applied to files:

  • router/core/supervisor_instance.go
  • router/core/router.go
  • router-tests/retry_test.go
📚 Learning: 2025-08-20T10:08:17.857Z
Learnt from: endigma
PR: wundergraph/cosmo#2155
File: router/core/router.go:1857-1866
Timestamp: 2025-08-20T10:08:17.857Z
Learning: In the Cosmo router codebase, JSON schema validation prevents null values in TrafficShapingRules subgraph configurations, making nil checks unnecessary when dereferencing subgraph rule pointers in NewSubgraphTransportOptions.

Applied to files:

  • router-tests/telemetry/telemetry_test.go
  • router-tests/error_handling_test.go
  • router-tests/structured_logging_test.go
  • router-tests/panic_test.go
📚 Learning: 2025-08-14T16:47:03.744Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2137
File: router/pkg/pubsub/redis/adapter.go:114-116
Timestamp: 2025-08-14T16:47:03.744Z
Learning: In the Cosmo router codebase, EventMetricStore fields are never nil - the system uses the null object pattern with NewNoopEventMetricStore() when metrics are disabled, eliminating the need for nil checks before calling EventMetricStore methods.

Applied to files:

  • router-tests/telemetry/telemetry_test.go
📚 Learning: 2025-08-26T17:30:41.212Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2167
File: router/core/retry_expression_builder.go:3-11
Timestamp: 2025-08-26T17:30:41.212Z
Learning: In router/core/retry_expression_builder.go, the team prefers to keep EOF error detection simple with string matching ("unexpected eof") rather than using errors.Is() for io.EOF and io.ErrUnexpectedEOF. They want to wait for customer feedback before making it more robust, since customers can modify retry expressions to handle specific cases.

Applied to files:

  • router/internal/expr/retry_expression_test.go
  • router/core/retry_builder_test.go
  • router/core/retry_builder.go
📚 Learning: 2025-08-28T09:59:19.956Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2167
File: router/core/supervisor_instance.go:196-204
Timestamp: 2025-08-28T09:59:19.956Z
Learning: In router/pkg/config/config.go, the BackoffJitterRetry struct uses envDefault tags to provide default values for Algorithm ("backoff_jitter") and Expression ("IsRetryableStatusCode() || IsConnectionError() || IsTimeout()"), ensuring that even when YAML configs omit these fields, valid defaults are applied automatically, preventing ProcessRetryOptions from rejecting empty values.

Applied to files:

  • router/core/retry_builder.go
📚 Learning: 2025-08-26T17:19:28.178Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2167
File: router/internal/expr/retry_context.go:3-10
Timestamp: 2025-08-26T17:19:28.178Z
Learning: In the Cosmo router codebase, prefer using netErr.Timeout() checks over explicit context.DeadlineExceeded checks for timeout detection in retry logic, as Go's networking stack typically wraps context deadline exceeded errors in proper net.Error implementations.

Applied to files:

  • router/internal/retrytransport/retry_transport_test.go
🧬 Code graph analysis (17)
router/core/supervisor_instance.go (1)
router/core/router.go (2)
  • WithSubgraphRetryOptions (1776-1780)
  • NewSubgraphRetryOptions (1908-1922)
router/core/router.go (2)
router/internal/retrytransport/manager.go (3)
  • RetryOptions (25-32)
  • OnRetryFunc (17-17)
  • BackoffJitter (22-22)
router/pkg/config/config.go (2)
  • TrafficShapingRules (164-171)
  • BackoffJitterRetry (231-238)
router-tests/telemetry/telemetry_test.go (2)
router/core/router.go (2)
  • WithSubgraphRetryOptions (1776-1780)
  • NewSubgraphRetryOptions (1908-1922)
router/pkg/config/config.go (3)
  • TrafficShapingRules (164-171)
  • GlobalSubgraphRequestRule (190-207)
  • BackoffJitterRetry (231-238)
router/internal/expr/retry_expression_test.go (2)
router/internal/expr/retry_expression.go (1)
  • NewRetryExpressionManager (19-23)
router/internal/expr/retry_context.go (1)
  • LoadRetryContext (132-146)
router/core/router_config.go (1)
router/core/router.go (1)
  • SubgraphRetryOptions (187-191)
router/internal/expr/retry_expression.go (1)
router/internal/expr/retry_context.go (1)
  • RetryContext (13-19)
router-tests/retry_test.go (4)
router/core/router.go (3)
  • NewSubgraphRetryOptions (1908-1922)
  • WithSubgraphRetryOptions (1776-1780)
  • Option (172-172)
router/pkg/config/config.go (4)
  • TrafficShapingRules (164-171)
  • GlobalSubgraphRequestRule (190-207)
  • BackoffJitterRetry (231-238)
  • Config (985-1059)
router/internal/retrytransport/manager.go (1)
  • OnRetryFunc (17-17)
router-tests/testenv/testenv.go (4)
  • Run (105-122)
  • RunWithError (135-150)
  • Environment (1727-1763)
  • SubgraphsConfig (365-378)
router-tests/error_handling_test.go (2)
router/core/router.go (2)
  • WithSubgraphRetryOptions (1776-1780)
  • NewSubgraphRetryOptions (1908-1922)
router/pkg/config/config.go (3)
  • TrafficShapingRules (164-171)
  • GlobalSubgraphRequestRule (190-207)
  • BackoffJitterRetry (231-238)
router/core/graph_server.go (4)
router/internal/retrytransport/manager.go (3)
  • Manager (34-40)
  • NewManager (42-49)
  • OnRetryFunc (17-17)
router/internal/expr/expr_manager.go (1)
  • Manager (15-17)
router/internal/expr/retry_expression.go (1)
  • NewRetryExpressionManager (19-23)
router/core/retry_builder.go (1)
  • BuildRetryFunction (13-46)
router-tests/structured_logging_test.go (2)
router/core/router.go (2)
  • WithSubgraphRetryOptions (1776-1780)
  • NewSubgraphRetryOptions (1908-1922)
router/pkg/config/config.go (3)
  • TrafficShapingRules (164-171)
  • GlobalSubgraphRequestRule (190-207)
  • BackoffJitterRetry (231-238)
router/internal/retrytransport/manager.go (2)
router/internal/expr/expr.go (1)
  • Request (65-74)
router/internal/expr/retry_expression.go (1)
  • RetryExpressionManager (12-14)
router/core/retry_builder_test.go (3)
router/internal/expr/retry_expression.go (1)
  • NewRetryExpressionManager (19-23)
router/core/retry_builder.go (1)
  • BuildRetryFunction (13-46)
router/core/context.go (2)
  • OperationTypeQuery (488-488)
  • OperationTypeMutation (489-489)
router/internal/retrytransport/retry_transport.go (3)
router/internal/retrytransport/manager.go (1)
  • Manager (34-40)
router/internal/expr/expr.go (2)
  • Request (65-74)
  • Context (34-38)
router/internal/context/keys.go (1)
  • CurrentSubgraphContextKey (7-7)
router/core/retry_builder.go (2)
router/internal/expr/retry_expression.go (1)
  • RetryExpressionManager (12-14)
router/internal/retrytransport/manager.go (1)
  • ShouldRetryFunc (16-16)
router/internal/retrytransport/retry_transport_test.go (5)
router/internal/expr/expr.go (2)
  • Request (65-74)
  • Context (34-38)
router/internal/context/keys.go (1)
  • CurrentSubgraphContextKey (7-7)
router/internal/retrytransport/manager.go (4)
  • OnRetryFunc (17-17)
  • RetryOptions (25-32)
  • Manager (34-40)
  • NewManager (42-49)
router/internal/expr/retry_expression.go (1)
  • NewRetryExpressionManager (19-23)
router/internal/retrytransport/retry_transport.go (1)
  • NewRetryHTTPTransport (77-87)
router-tests/panic_test.go (2)
router/core/router.go (2)
  • WithSubgraphRetryOptions (1776-1780)
  • NewSubgraphRetryOptions (1908-1922)
router/pkg/config/config.go (3)
  • TrafficShapingRules (164-171)
  • GlobalSubgraphRequestRule (190-207)
  • BackoffJitterRetry (231-238)
router/core/transport.go (2)
router/internal/retrytransport/manager.go (1)
  • Manager (34-40)
router/internal/retrytransport/retry_transport.go (1)
  • NewRetryHTTPTransport (77-87)
🔇 Additional comments (23)
router/core/router_config.go (1)

79-79: Switch to pointer: ensure initialization path or guard against nil

Changing retryOptions to *SubgraphRetryOptions is fine, but it increases the chance of nil deref at use sites unless always initialized.

Please confirm that core.Config.retryOptions is assigned during router setup. Currently, WithSubgraphRetryOptions sets r.retryOptions (see router/core/router.go), but I don’t see where c.retryOptions (this struct) is populated. If not populated, prefer guarding in Usage() (see below) or setting it alongside the router field.

router/core/supervisor_instance.go (1)

196-197: Guard disabled subgraph retries before defaulting algorithm
In NewSubgraphRetryOptions (router/core/router.go ¶1909–1921), you always call newRetryConfig for each subgraph, which unconditionally defaults an empty algorithm—even when Enabled==false. Per our retry-disabled rule, skip both the call and algorithm default for subgraphs with v.BackoffJitterRetry.Enabled==false. Wrap the loop entry in an if v.BackoffJitterRetry.Enabled check so disabled rules retain an empty algorithm.

⛔ Skipped due to learnings
Learnt from: SkArchon
PR: wundergraph/cosmo#2167
File: router-tests/structured_logging_test.go:712-713
Timestamp: 2025-08-28T10:07:27.637Z
Learning: In the router retry configuration, when retries are disabled (enabled=false in WithSubgraphRetryOptions), the algorithm parameter should not be validated since it won't be used. Empty algorithm strings are acceptable when retries are disabled.
Learnt from: endigma
PR: wundergraph/cosmo#2155
File: router/core/router.go:1857-1866
Timestamp: 2025-08-20T10:08:17.857Z
Learning: router/pkg/config/config.schema.json forbids null values for traffic_shaping.subgraphs: additionalProperties references $defs.traffic_shaping_subgraph_request_rule with type "object". Therefore, in core.NewSubgraphTransportOptions, dereferencing each subgraph rule pointer is safe under schema-validated configs, and a nil-check is unnecessary.
Learnt from: endigma
PR: wundergraph/cosmo#2155
File: router/core/router.go:1857-1866
Timestamp: 2025-08-20T10:08:17.857Z
Learning: In the Cosmo router codebase, JSON schema validation prevents null values in TrafficShapingRules subgraph configurations, making nil checks unnecessary when dereferencing subgraph rule pointers in NewSubgraphTransportOptions.
Learnt from: SkArchon
PR: wundergraph/cosmo#2167
File: router/core/supervisor_instance.go:196-204
Timestamp: 2025-08-28T09:59:19.956Z
Learning: In router/pkg/config/config.go, the BackoffJitterRetry struct uses envDefault tags to provide default values for Algorithm ("backoff_jitter") and Expression ("IsRetryableStatusCode() || IsConnectionError() || IsTimeout()"), ensuring that even when YAML configs omit these fields, valid defaults are applied automatically, preventing ProcessRetryOptions from rejecting empty values.
router-tests/panic_test.go (1)

51-55: Good: retries explicitly disabled for panic-path stability

Switching to core.NewSubgraphRetryOptions(... Enabled: false ...) ensures the router won’t add subgraph retries that could mask panic timing in these tests. Matches the disabled-validation expectation.

Also applies to: 87-91

router/core/graph_server.go (3)

43-43: Import looks correct

retrytransport is used below; import is appropriate.


104-105: Field addition LGTM

Storing the retry manager on graphServer is fine and consistent with how other managers (circuit) are handled.


1208-1209: Wiring RetryManager into transport options is correct

ExecutorConfigurationBuilder receives the manager; assuming transport handles nil, this is safe.

If not already covered in tests, add one integration that omits WithSubgraphRetryOptions entirely to ensure nil RetryManager is handled gracefully.

router/internal/expr/retry_expression_test.go (2)

125-135: Manager-based API usage looks correct

Instantiating the manager, pre-registering expressions with AddExpression, and evaluating via ShouldRetry(ctx, expr) matches the new API. Good assertions on compile-time errors for invalid expressions.


496-525: Good syscall coverage with the manager path

Nice coverage for ECONNREFUSED/ECONNRESET/ETIMEDOUT under the manager-evaluated expression. This ensures helper wiring remains intact after the API shift.

router/core/router.go (1)

1776-1779: Confirmed retryOptions field exists on Router.Config
Router embeds Config, which declares retryOptions *SubgraphRetryOptions, so the WithSubgraphRetryOptions setter compiles as expected.

router/core/retry_builder.go (1)

55-58: EOF-as-default-retryable matches stated team preference

Keeping this simple with string matching for “unexpected eof” is consistent with prior guidance to wait for customer feedback before broadening detection.

router-tests/retry_test.go (4)

38-51: Good adoption of builder-style retry options.

Switching to NewSubgraphRetryOptions + WithSubgraphRetryOptions makes tests clearer and mirrors the new API.


247-306: Exact sleep equality on Retry-After is OK here.

Asserting sleepDuration equals header seconds is deterministic since jitter shouldn’t apply when Retry-After is honored.


649-674: Great coverage of mixed global + per-subgraph overrides.

Validates default-from-All and targeted override behavior cleanly.

Also applies to: 676-719


363-371: Unable to run flakiness check in sandbox
The go test invocation failed due to missing Go module context. Please run locally and confirm stability:

go test ./router-tests -run 'TestFlakyRetry/verify max duration is not exceeded on intervals$' -count=50 -v

If you observe intermittent failures, remove t.Parallel() from this subtest or widen the timing tolerances around the expected duration.

router/internal/retrytransport/retry_transport_test.go (3)

58-61: withSubgraph helper looks correct.

Uses the typed context key; safe and explicit.


204-231: Solid body-draining tests.

Covers read and close failure paths with log assertions. Good observability checks.

Also applies to: 300-357


483-538: Thorough 429 coverage (seconds, HTTP-date, invalid, negative, mixed).

Great breadth; confirms shouldUseRetryAfter behavior and fallbacks.

Also applies to: 540-596, 598-636, 638-697, 699-741, 742-782, 784-863

router/core/retry_builder_test.go (3)

60-85: Default expression behavior assertions look good.

Covers 5xx and default network errors.


241-270: Mutation short-circuit behavior well covered.

Ensures expressions can’t override the “no retry on mutations” rule.


390-439: Comprehensive positive matrix + mutation negative matrix.

Good end-to-end validation of the expression path.

router/internal/retrytransport/manager.go (1)

61-69: LGTM on skipping algorithm validation when retries are disabled.

Disabled per-subgraph entries are excluded from validation, aligning with the prior decision captured in learnings.

router/core/transport.go (2)

105-107: Wiring to Manager-based retry looks correct.

Conditional wrapping with NewRetryHTTPTransport is clean and preserves the base round tripper path when disabled.


73-75: All call sites updated. Verified NewCustomTransport/NewRetryHTTPTransport and RetryManager usage in router/core/transport.go (lines 106, 367, 388, 428) and router/core/graph_server.go (line 1208).

Comment thread router-tests/retry_test.go
Comment thread router-tests/retry_test.go
Comment thread router/core/retry_builder.go Outdated
Comment thread router/core/router_config.go
Comment thread router/internal/retrytransport/manager.go Outdated
Comment thread router/internal/retrytransport/manager.go
Comment thread router/internal/retrytransport/manager.go
Comment thread router/internal/retrytransport/manager.go Outdated
Comment thread router/internal/retrytransport/retry_transport_test.go
Comment thread router/internal/retrytransport/retry_transport.go
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
router/internal/retrytransport/retry_transport.go (1)

145-151: Rewind request body before retrying; otherwise subsequent attempts may send an empty/partial body.

GraphQL subgraph calls are typically POST with bodies. Without resetting Body via GetBody, retries can be broken.

 		// drain the previous response before retrying
 		rt.drainBody(resp, requestLogger)
 
 		// Retry the request
-		resp, err = rt.roundTripper.RoundTrip(req)
+		// Reset request body if necessary
+		if req.Body != nil && req.Body != http.NoBody {
+			if req.GetBody == nil {
+				requestLogger.Warn("Cannot retry request with non-rewindable body")
+				break
+			}
+			newBody, gbErr := req.GetBody()
+			if gbErr != nil {
+				requestLogger.Error("Failed to reset request body for retry", zap.Error(gbErr))
+				return resp, gbErr
+			}
+			req.Body = newBody
+		}
+		resp, err = rt.roundTripper.RoundTrip(req)
♻️ Duplicate comments (3)
router-tests/retry_test.go (3)

503-531: Great: base-level invalid algorithm is correctly rejected at init.
Matches expected behavior and error text.


533-558: Nice: invalid algorithm is ignored when retries are disabled (base).
This addresses the previously requested coverage.


560-587: Nice: per-subgraph “disabled ignores algorithm” case covered.
Prevents regressions in validation gating.

🧹 Nitpick comments (14)
router-tests/retry_test.go (8)

38-51: DRY the options setup with a small helper to reduce duplication.
Multiple tests repeat the same NewSubgraphRetryOptions + WithSubgraphRetryOptions pattern; a local builder will simplify maintenance.

Apply this helper once near the top and replace repeated blocks:

+func withRetryOpts(cfg config.TrafficShapingRules, onRetry func(int, *http.Request, *http.Response, time.Duration, error)) core.Option {
+  opts := core.NewSubgraphRetryOptions(cfg)
+  opts.OnRetryFunc = onRetry
+  return core.WithSubgraphRetryOptions(opts)
+}

Also applies to: 92-105, 145-158, 199-212, 260-273, 324-337, 391-404, 451-464


224-231: Decouple success trigger from OnRetry callback semantics.
Using onRetryCounter to flip success couples the test to callback timing. Drive success via request attempt count to avoid false positives if OnRetry invocation timing changes.

- if onRetryCounter.Load() == int32(maxAttemptsBeforeServiceSucceeds) {
+ if serviceCallsCounter.Load() == int32(maxAttemptsBeforeServiceSucceeds) {
    w.WriteHeader(http.StatusOK)
    _, _ = w.Write([]byte(`{"data":{"employees":[{"id":1},{"id":2}]}}`))
  } else {
    w.WriteHeader(http.StatusBadGateway)
  }
  serviceCallsCounter.Add(1)

256-257: Speed up the 429 + Retry-After test.
This test currently sleeps ~3s (3 retries × 1s). One retry is sufficient to validate header precedence.

- maxRetryCount := 3
+ maxRetryCount := 1

433-434: Avoid flakiness: assert a range instead of NotEqual.
Exact equality with retryInterval can sporadically occur depending on jitter/rounding.

- require.NotEqual(t, sleepDuration.Load(), int64(retryInterval))
+ d := time.Duration(sleepDuration.Load())
+ require.Greater(t, d, time.Duration(0))
+ require.Less(t, d, 2*retryInterval)

494-495: Same here: use bounded checks to reduce flakiness.
Zero Retry-After should fall back to jitter/backoff; prefer range assertions.

- require.NotEqual(t, sleepDuration.Load(), int64(retryInterval))
+ d := time.Duration(sleepDuration.Load())
+ require.Greater(t, d, time.Duration(0))
+ require.Less(t, d, 2*retryInterval)

360-362: Prefer JSON structural equality over string equality for response bodies.
String comparisons are brittle to formatting/ordering changes.

- require.Equal(t, `{"errors":[{"message":"Failed to fetch from Subgraph 'employees', Reason: empty response.","extensions":{"statusCode":502}}],"data":{"employees":null}}`, res.Body)
+ require.JSONEq(t, `{"errors":[{"message":"Failed to fetch from Subgraph 'employees', Reason: empty response.","extensions":{"statusCode":502}}],"data":{"employees":null}}`, res.Body)

(Apply similarly to the other exact body checks.)

Also applies to: 681-685, 761-763


366-372: Widen timing slack to reduce CI flakes.
A 20ms/100ms margin can be tight under load; slightly larger buffers help.

- shouldBeLessThanDuration := (time.Duration(maxRetryCount-1) * retryInterval) - (20 * time.Millisecond)
+ shouldBeLessThanDuration := (time.Duration(maxRetryCount-1) * retryInterval) - (50 * time.Millisecond)
...
- expectedMinDuration := (time.Duration(maxRetryCount-1) * maxDuration) - (100 * time.Millisecond)
+ expectedMinDuration := (time.Duration(maxRetryCount-1) * maxDuration) - (150 * time.Millisecond)

621-697: Per-subgraph retry counts validated correctly.
Isolated counters for employees vs test1 look good.

If useful, I can add a combined query hitting both subgraphs in a single operation to verify independent retry accounting within one request.

router/internal/retrytransport/retry_transport.go (6)

142-144: Honor context cancellation/deadlines during backoff sleep.

Avoid sleeping past ctx deadlines; exit early if canceled.

-		// Wait for the specified duration
-		time.Sleep(sleepDuration)
+		// Wait for the specified duration or context cancellation
+		if sleepDuration > 0 {
+			select {
+			case <-req.Context().Done():
+				return resp, req.Context().Err()
+			case <-time.After(sleepDuration):
+			}
+		}

59-66: Also honor Retry-After for 503 Service Unavailable.

RFC 7231 allows Retry-After on 503 responses as well as 429. This improves interoperability with upstreams.

-	if resp == nil || resp.StatusCode != http.StatusTooManyRequests {
+	if resp == nil || (resp.StatusCode != http.StatusTooManyRequests && resp.StatusCode != http.StatusServiceUnavailable) {
 		return 0, false
 	}

124-128: Generic log for Retry-After usage (it may be 429 or 503).

-			requestLogger.Debug("Using Retry-After header for 429 response",
+			requestLogger.Debug("Using Retry-After header",
 				zap.Int("retry", retries),
 				zap.String("url", req.URL.String()),
 				zap.Duration("retry-after", sleepDuration),
 			)

51-53: Lower log level for malformed Retry-After headers.

Header parsing errors are client-observable but non-fatal; debug (or warn) is less noisy in production.

-		logger.Error("Failed to parse Retry-After header", zap.String("retry-after", retryAfter), zap.Error(errJoin))
+		logger.Debug("Failed to parse Retry-After header", zap.String("retry-after", retryAfter), zap.Error(errJoin))

173-176: Tighten the comment wording.

-	// When we close the body only will go internally marks the persisted connection as true
-	// which is important so that it can reuse the connection internally for retrying
+	// Draining then closing marks the connection as reusable, allowing the client to reuse it for retries.

105-111: Optional: short-circuit when retries are effectively disabled.

If MaxRetryCount == 0 (or options are disabled), return immediately to avoid backoff setup and logging overhead. Note: per team learning, IsEnabled() handles nil receivers safely if available.

 	retryOptions := rt.retryManager.GetSubgraphOptions(subgraph)
-	if retryOptions == nil {
-		return resp, nil
-	}
+	if retryOptions == nil {
+		return resp, err
+	}
+	// Optional fast-exit:
+	// if retryOptions.MaxRetryCount == 0 { return resp, err }
+	// or if available:
+	// if !retryOptions.IsEnabled() { return resp, err }
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between d86e5cc and 79cdb75.

📒 Files selected for processing (3)
  • router-tests/retry_test.go (10 hunks)
  • router/internal/retrytransport/manager.go (1 hunks)
  • router/internal/retrytransport/retry_transport.go (4 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • router/internal/retrytransport/manager.go
🧰 Additional context used
🧠 Learnings (4)
📚 Learning: 2025-09-01T13:44:13.033Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2172
File: router/core/router_config.go:221-221
Timestamp: 2025-09-01T13:44:13.033Z
Learning: The IsEnabled() method on SubgraphRetryOptions in router/core/router.go safely handles nil receivers by returning false as the first check, so no external nil checks are needed when calling c.retryOptions.IsEnabled(). This is a consistent pattern used across multiple IsEnabled() methods in the codebase.

Applied to files:

  • router-tests/retry_test.go
  • router/internal/retrytransport/retry_transport.go
📚 Learning: 2025-08-20T10:08:17.857Z
Learnt from: endigma
PR: wundergraph/cosmo#2155
File: router/core/router.go:1857-1866
Timestamp: 2025-08-20T10:08:17.857Z
Learning: router/pkg/config/config.schema.json forbids null values for traffic_shaping.subgraphs: additionalProperties references $defs.traffic_shaping_subgraph_request_rule with type "object". Therefore, in core.NewSubgraphTransportOptions, dereferencing each subgraph rule pointer is safe under schema-validated configs, and a nil-check is unnecessary.

Applied to files:

  • router-tests/retry_test.go
📚 Learning: 2025-09-01T13:44:13.033Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2172
File: router/core/router_config.go:221-221
Timestamp: 2025-09-01T13:44:13.033Z
Learning: The IsEnabled() method on SubgraphRetryOptions in router/core/router.go safely handles nil receivers by returning false, so no external nil checks are needed when calling c.retryOptions.IsEnabled().

Applied to files:

  • router-tests/retry_test.go
  • router/internal/retrytransport/retry_transport.go
📚 Learning: 2025-08-28T10:07:27.637Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2167
File: router-tests/structured_logging_test.go:712-713
Timestamp: 2025-08-28T10:07:27.637Z
Learning: In the router retry configuration, when retries are disabled (enabled=false in WithSubgraphRetryOptions), the algorithm parameter should not be validated since it won't be used. Empty algorithm strings are acceptable when retries are disabled.

Applied to files:

  • router-tests/retry_test.go
🧬 Code graph analysis (2)
router-tests/retry_test.go (4)
router/core/router.go (3)
  • NewSubgraphRetryOptions (1908-1922)
  • WithSubgraphRetryOptions (1776-1780)
  • Option (172-172)
router/pkg/config/config.go (4)
  • TrafficShapingRules (164-171)
  • GlobalSubgraphRequestRule (190-207)
  • BackoffJitterRetry (231-238)
  • Config (985-1059)
router/internal/retrytransport/manager.go (1)
  • OnRetryFunc (17-17)
router-tests/testenv/testenv.go (4)
  • Run (105-122)
  • RunWithError (135-150)
  • Environment (1727-1763)
  • SubgraphsConfig (365-378)
router/internal/retrytransport/retry_transport.go (3)
router/internal/retrytransport/manager.go (1)
  • Manager (34-40)
router/internal/expr/expr.go (2)
  • Request (65-74)
  • Context (34-38)
router/internal/context/keys.go (1)
  • CurrentSubgraphContextKey (7-7)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
  • GitHub Check: build-router
  • GitHub Check: build_push_image
  • GitHub Check: integration_test (./events)
  • GitHub Check: build_push_image (nonroot)
  • GitHub Check: image_scan (nonroot)
  • GitHub Check: build_test
  • GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
  • GitHub Check: integration_test (./telemetry)
  • GitHub Check: image_scan
  • GitHub Check: Analyze (javascript-typescript)
  • GitHub Check: Analyze (go)
🔇 Additional comments (4)
router-tests/retry_test.go (4)

10-13: Good import adjustments for new API usage.
The added imports align with the builder-style options and testify assertions.


16-23: OnRetry counter helper is concise and thread-safe.
Atomic counters and nil-guard for duration are correct.


589-619: Per-subgraph invalid algorithm correctly fails fast.
Solid negative test to mirror the base-level case.


699-775: Mixed global + per-subgraph overrides behave as intended.
Employees inherit global, test1 overrides; counters confirm isolation.

Comment thread router/internal/retrytransport/retry_transport.go Outdated
Comment thread router/internal/retrytransport/retry_transport.go Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
router/internal/retrytransport/retry_transport.go (1)

147-152: Critical: retries will resend a consumed request body

On retry, req.Body is already consumed. If req.GetBody is unset, retries are unsafe; if set, reset Body before each retry. Otherwise you’ll send an empty body or fail.

Apply:

-		// Wait for the specified duration
-		time.Sleep(sleepDuration)
-
-		// drain the previous response before retrying
-		rt.drainBody(resp, requestLogger)
+		// Drain previous response ASAP to free the connection for reuse
+		rt.drainBody(resp, requestLogger)
+
+		// Respect request cancellation while waiting
+		if sleepDuration > 0 {
+			timer := time.NewTimer(sleepDuration)
+			select {
+			case <-req.Context().Done():
+				timer.Stop()
+				return resp, req.Context().Err()
+			case <-timer.C:
+			}
+		}
+
+		// Rewind request body if present
+		if req.Body != nil {
+			if req.GetBody == nil {
+				requestLogger.Debug("cannot retry: non-rewindable request body")
+				break
+			}
+			if newBody, gErr := req.GetBody(); gErr != nil {
+				return resp, gErr
+			} else {
+				req.Body = newBody
+			}
+		}
 
 		// Retry the request
 		resp, err = rt.roundTripper.RoundTrip(req)
♻️ Duplicate comments (1)
router/internal/retrytransport/retry_transport.go (1)

104-108: Nice fix: avoids a second request when no retry options

Returning the outcome of the first attempt preserves semantics and errors.

🧹 Nitpick comments (5)
router/internal/retrytransport/retry_transport.go (5)

58-75: Honor Retry-After for 503 as well and accept zero-second delays

RFC 7231 defines Retry-After for 503 and 429. Also treat “0” as a valid immediate retry (no sleep), not a fallback to random backoff.

Apply:

-func shouldUseRetryAfter(logger *zap.Logger, resp *http.Response, maxDuration time.Duration) (time.Duration, bool) {
-	if resp == nil || resp.StatusCode != http.StatusTooManyRequests {
+func shouldUseRetryAfter(logger *zap.Logger, resp *http.Response, maxDuration time.Duration) (time.Duration, bool) {
+	if resp == nil || (resp.StatusCode != http.StatusTooManyRequests && resp.StatusCode != http.StatusServiceUnavailable) {
 		return 0, false
 	}
@@
-	return duration, duration > 0
+	// Accept immediate retry (0s) as valid
+	return duration, duration >= 0

22-56: Minor: trim header value before parsing; lower log level on parse failures

Real-world proxies sometimes add whitespace; parsing “ 120 ” should work. Also consider debug level to reduce noise.

Apply:

 import (
 	"errors"
 	"io"
 	"net/http"
+	"strings"
 	"strconv"
 	"time"
@@
-func parseRetryAfterHeader(logger *zap.Logger, retryAfter string) time.Duration {
+func parseRetryAfterHeader(logger *zap.Logger, retryAfter string) time.Duration {
 	if retryAfter == "" {
 		return 0
 	}
 
 	var errJoin error
 
-	seconds, err := strconv.Atoi(retryAfter)
+	retryAfter = strings.TrimSpace(retryAfter)
+	seconds, err := strconv.Atoi(retryAfter)
@@
-	if errJoin != nil {
-		logger.Error("Failed to parse Retry-After header", zap.String("retry-after", retryAfter), zap.Error(errJoin))
-	}
+	if errJoin != nil {
+		logger.Debug("Failed to parse Retry-After header", zap.String("retry-after", retryAfter), zap.Error(errJoin))
+	}

144-149: Drain before sleeping to free connections earlier

Draining before the wait reduces connection pinning during long sleeps.

(Already included in the diff above.)


183-187: Guard against nil resp in helpers (defensive)

Not strictly necessary given current call sites, but adding a nil check makes the helper safer.

Apply:

 func isResponseOK(resp *http.Response) bool {
+	if resp == nil {
+		return false
+	}
 	// Ensure we don't wait for no reason when subgraphs don't behave
 	// spec-compliant and returns a different status code than 200.
 	return resp.StatusCode >= 200 && resp.StatusCode < 300
 }

175-181: Polish the comment grammar

Improve clarity of the connection reuse note.

Apply:

-	// When we close the body only will go internally marks the persisted connection as true
-	// which is important so that it can reuse the connection internally for retrying
+	// Fully consuming and closing the body marks the connection reusable
+	// so the Transport can keep-alive and reuse it for retries.
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 79cdb75 and 3344631.

📒 Files selected for processing (2)
  • router/internal/retrytransport/manager.go (1 hunks)
  • router/internal/retrytransport/retry_transport.go (4 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • router/internal/retrytransport/manager.go
🧰 Additional context used
🧠 Learnings (5)
📓 Common learnings
Learnt from: SkArchon
PR: wundergraph/cosmo#2167
File: router-tests/structured_logging_test.go:712-713
Timestamp: 2025-08-28T10:07:27.637Z
Learning: In the router retry configuration, when retries are disabled (enabled=false in WithSubgraphRetryOptions), the algorithm parameter should not be validated since it won't be used. Empty algorithm strings are acceptable when retries are disabled.
📚 Learning: 2025-09-01T13:44:13.033Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2172
File: router/core/router_config.go:221-221
Timestamp: 2025-09-01T13:44:13.033Z
Learning: The IsEnabled() method on SubgraphRetryOptions in router/core/router.go safely handles nil receivers by returning false as the first check, so no external nil checks are needed when calling c.retryOptions.IsEnabled(). This is a consistent pattern used across multiple IsEnabled() methods in the codebase.

Applied to files:

  • router/internal/retrytransport/retry_transport.go
📚 Learning: 2025-09-01T13:44:13.033Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2172
File: router/core/router_config.go:221-221
Timestamp: 2025-09-01T13:44:13.033Z
Learning: The IsEnabled() method on SubgraphRetryOptions in router/core/router.go safely handles nil receivers by returning false, so no external nil checks are needed when calling c.retryOptions.IsEnabled().

Applied to files:

  • router/internal/retrytransport/retry_transport.go
📚 Learning: 2025-09-01T13:26:16.245Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2172
File: router/core/retry_builder.go:13-15
Timestamp: 2025-09-01T13:26:16.245Z
Learning: In router/core/retry_builder.go, the BuildRetryFunction's manager parameter is guaranteed to be non-nil by the calling code, so defensive nil checks are not needed according to the team's preference.

Applied to files:

  • router/internal/retrytransport/retry_transport.go
📚 Learning: 2025-08-20T10:08:17.857Z
Learnt from: endigma
PR: wundergraph/cosmo#2155
File: router/core/router.go:1857-1866
Timestamp: 2025-08-20T10:08:17.857Z
Learning: router/pkg/config/config.schema.json forbids null values for traffic_shaping.subgraphs: additionalProperties references $defs.traffic_shaping_subgraph_request_rule with type "object". Therefore, in core.NewSubgraphTransportOptions, dereferencing each subgraph rule pointer is safe under schema-validated configs, and a nil-check is unnecessary.

Applied to files:

  • router/internal/retrytransport/retry_transport.go
🧬 Code graph analysis (1)
router/internal/retrytransport/retry_transport.go (3)
router/internal/retrytransport/manager.go (1)
  • Manager (34-40)
router/internal/expr/expr.go (2)
  • Request (65-74)
  • Context (34-38)
router/internal/context/keys.go (1)
  • CurrentSubgraphContextKey (7-7)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
  • GitHub Check: build-router
  • GitHub Check: Analyze (go)
  • GitHub Check: image_scan (nonroot)
  • GitHub Check: build_push_image (nonroot)
  • GitHub Check: Analyze (javascript-typescript)
  • GitHub Check: image_scan
  • GitHub Check: build_push_image
  • GitHub Check: integration_test (./events)
  • GitHub Check: integration_test (./telemetry)
  • GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
  • GitHub Check: build_test
🔇 Additional comments (5)
router/internal/retrytransport/retry_transport.go (5)

16-20: Good: transport now composes roundTripper and retryManager cleanly

Field injection improves testability and decoupling.


77-87: Constructor wiring looks solid

Signature change and assignments are consistent.


139-142: OnRetry hook usage is correct

API matches Manager.OnRetry(retries, req, resp, sleepDuration, err).


110-111: Usage correct: backoff.New expects (max, interval) signature
The arguments retryOptions.MaxDuration (max) and retryOptions.Interval (interval) align with func New(max time.Duration, interval time.Duration) *Backoff; no changes needed.


96-103: Verify CurrentSubgraphContextKey value type
Ensure that any value stored under rcontext.CurrentSubgraphContextKey is always a string; adjust the type assertion if other types are ever used.

Comment thread router/internal/retrytransport/retry_transport.go
@SkArchon SkArchon changed the title feat: retry per subgraphs feat: specify retry configuration overrides per subgraph Sep 1, 2025
@SkArchon SkArchon requested a review from endigma as a code owner September 2, 2025 11:23
Comment thread router-tests/security/retry_test.go
Comment thread router/core/retry_builder_test.go Outdated
Comment thread router/internal/retrytransport/manager.go Outdated
@SkArchon SkArchon requested a review from StarpTech September 3, 2025 13:10
@SkArchon
Copy link
Copy Markdown
Contributor Author

@coderabbitai pause

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Sep 17, 2025

✅ Actions performed

Reviews paused.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Oct 2, 2025

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions Bot added the Stale label Oct 2, 2025
@github-actions
Copy link
Copy Markdown

Closed as inactive. Feel free to reopen if this PR is still being worked on.

@github-actions github-actions Bot closed this Oct 16, 2025
@SkArchon SkArchon reopened this Oct 16, 2025
@SkArchon
Copy link
Copy Markdown
Contributor Author

@StarpTech Could I get a re-review (remaining on my side to fix circuit breaker test)

@github-actions github-actions Bot removed the Stale label Oct 17, 2025
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Nov 1, 2025

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions Bot added the Stale label Nov 1, 2025
@github-actions
Copy link
Copy Markdown

Closed as inactive. Feel free to reopen if this PR is still being worked on.

@github-actions github-actions Bot closed this Nov 15, 2025
@SkArchon SkArchon reopened this Apr 7, 2026
@SkArchon SkArchon marked this pull request as draft April 7, 2026 11:42
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 7, 2026

Codecov Report

❌ Patch coverage is 87.28324% with 22 lines in your changes missing coverage. Please review.
✅ Project coverage is 65.82%. Comparing base (09e4653) to head (65b1db8).
⚠️ Report is 43 commits behind head on main.

Files with missing lines Patch % Lines
router/internal/retrytransport/manager.go 87.30% 4 Missing and 4 partials ⚠️
router/core/transport.go 68.42% 4 Missing and 2 partials ⚠️
router/internal/expr/retry_expression.go 85.00% 1 Missing and 2 partials ⚠️
router/core/router.go 92.00% 1 Missing and 1 partial ⚠️
router/internal/retrytransport/retry_transport.go 88.88% 1 Missing and 1 partial ⚠️
router/core/supervisor_instance.go 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2172      +/-   ##
==========================================
+ Coverage   63.46%   65.82%   +2.36%     
==========================================
  Files         251      255       +4     
  Lines       26767    26482     -285     
==========================================
+ Hits        16987    17432     +445     
+ Misses       8416     7652     -764     
- Partials     1364     1398      +34     
Files with missing lines Coverage Δ
router/core/engine_loader_hooks.go 91.92% <ø> (-0.05%) ⬇️
router/core/graph_server.go 82.42% <100.00%> (-2.64%) ⬇️
router/core/retry_builder.go 65.21% <100.00%> (-15.18%) ⬇️
router/core/router_config.go 93.75% <100.00%> (ø)
router/internal/circuit/breaker.go 91.30% <100.00%> (-1.29%) ⬇️
router/internal/traceclient/traceclient.go 96.42% <100.00%> (-0.24%) ⬇️
router/core/supervisor_instance.go 0.00% <0.00%> (ø)
router/core/router.go 70.51% <92.00%> (+0.40%) ⬆️
router/internal/retrytransport/retry_transport.go 97.64% <88.88%> (-2.36%) ⬇️
router/internal/expr/retry_expression.go 75.75% <85.00%> (+3.03%) ⬆️
... and 2 more

... and 20 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@github-actions github-actions Bot removed the Stale label Apr 8, 2026
@SkArchon SkArchon marked this pull request as ready for review April 9, 2026 11:36
@SkArchon SkArchon force-pushed the milinda/retry-per-subgraphs branch from 82c297e to 4f52652 Compare April 21, 2026 14:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants